Week 3 - Simple and multiple linear regression

PSC 103B

Marwin Carmo

Today’s dataset

We can access this dataset by installing the palmerspenguins package.

install.packages("palmerpenguins")
library(palmerpenguins)
dplyr::glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Today’s dataset

Bill dimensions

Two-Sample t-test

  • Outcome variable: bill_length_mm

  • Not all penguins gave data on bill length and there are some missing values.

  • The complete.cases() function gives the row numbers where there is non-missing values on the variable you give it.

penguins_subset <- penguins[complete.cases(penguins$bill_length_mm),]

Two-Sample t-test

  • Suppose we were interested in whether male penguins or female penguins had different bill lengths.

  • We suspected that male penguins have longer bill lengths than female penguins.

  • Let’s look at both means

# Average bill lenght for males
mean(penguins_subset$bill_length_mm[penguins_subset$sex == "male"],
     na.rm = TRUE)
[1] 45.85476
# Average bill lenght for females
mean(penguins_subset$bill_length_mm[penguins_subset$sex == "female"],
     na.rm = TRUE)
[1] 42.09697

Two-Sample t-test

  • Another way to do this is to use the tapply() function.

  • tapply(variable, group, function, extra arguments for the function)

tapply(penguins_subset$bill_length_mm, 
       penguins_subset$sex, mean, na.rm = TRUE)
  female     male 
42.09697 45.85476 

Two-Sample t-test

  • Is the numerical difference of ~4 mm actually significant?

  • \(H_0: \mu_{female} = \mu_{male}\), or the average bill length of females is the same as the average bill length of males.

  • \(H_1: \mu_{female} < \mu_{male}\), or the average bill length of females is less than that of males.

  • The t-test is trying to see whether the difference you observed between the groups is large given the expected variability of that difference across samples.

Two-Sample t-test

  • Our hypothesis was that females have shorter bill lengths than males.

  • R views the females as Group 1 and males as Group 2 (because female is alphabetically before male). We need to decide our alternative with Group 1 compared to Group 2.

  • Using the syntax in the next slide, replace the placeholders with the name of the variables we’re interested in.

Now you try

t.test(dependent_variable ~ group_variable, data = dataset,
       alternative = "???")

Tip

The argument alternative specify the alternative hypothesis and can take any of these three values: "two.sided", "less", or "greater". Think about our hypothesis to choose one the alternatives.

Now you try

t.test(bill_length_mm ~ sex, data = penguins_subset, alternative = "less")

    Welch Two Sample t-test

data:  bill_length_mm by sex
t = -6.6725, df = 329.29, p-value = 5.332e-11
alternative hypothesis: true difference in means between group female and group male is less than 0
95 percent confidence interval:
     -Inf -2.82883
sample estimates:
mean in group female   mean in group male 
            42.09697             45.85476 

Write-up the results

The Welch Two Sample t-test found that female penguins (M = 42.1, SD = 4.90) have, on average, shorter bill lenghts than male penguins (M = 45.9, SD = 5.37), t(329.29) = -6.67, p < .001.

Before we move on…

  • Notice that R gives us by default the Welch’s t-test.

  • It is used when the number of samples in each group is different, and the variance of the two data sets is also different. Usually that is a safe assumption.

  • If you want to assume equal variances, set the argument var.equal = TRUE.